首页> 外文OA文献 >Scalable Locality-Sensitive Hashing for Similarity Search in High-Dimensional, Large-Scale Multimedia Datasets
【2h】

Scalable Locality-Sensitive Hashing for Similarity Search in High-Dimensional, Large-Scale Multimedia Datasets

机译:用于相似性搜索的可扩展的局部敏感哈希   高维,大规模多媒体数据集

摘要

Similarity search is critical for many database applications, including theincreasingly popular online services for Content-Based Multimedia Retrieval(CBMR). These services, which include image search engines, must handle anoverwhelming volume of data, while keeping low response times. Thus,scalability is imperative for similarity search in Web-scale applications, butmost existing methods are sequential and target shared-memory machines. Here weaddress these issues with a distributed, efficient, and scalable index based onLocality-Sensitive Hashing (LSH). LSH is one of the most efficient and populartechniques for similarity search, but its poor referential locality propertieshas made its implementation a challenging problem. Our solution is based on awidely asynchronous dataflow parallelization with a number of optimizationsthat include a hierarchical parallelization to decouple indexing and datastorage, locality-aware data partition strategies to reduce message passing,and multi-probing to limit memory usage. The proposed parallelization attainedan efficiency of 90% in a distributed system with about 800 CPU cores. Inparticular, the original locality-aware data partition reduced the number ofmessages exchanged in 30%. Our parallel LSH was evaluated using the largestpublic dataset for similarity search (to the best of our knowledge) with $10^9$128-d SIFT descriptors extracted from Web images. This is two orders ofmagnitude larger than datasets that previous LSH parallelizations could handle.
机译:相似性搜索对于许多数据库应用程序都是至关重要的,其中包括基于内容的多媒体检索(CBMR)越来越流行的在线服务。这些服务(包括图像搜索引擎)必须处理大量数据,同时保持较低的响应时间。因此,可伸缩性对于Web规模的应用程序中的相似性搜索是必不可少的,但是大多数现有方法是顺序的和目标共享内存的机器。在这里,我们通过基于局部敏感哈希(LSH)的分布式,高效且可扩展的索引解决了这些问题。 LSH是用于相似度搜索的最有效,最流行的技术之一,但是它较差的引用局部性使它的实施成为一个具有挑战性的问题。我们的解决方案基于广泛的异步数据流并行化,并进行了许多优化,包括使索引和数据存储脱钩的分层并行化,减少消息传递的位置感知数据分区策略以及限制内存使用的多重探测。建议的并行化在具有约800个CPU内核的分布式系统中实现了90%的效率。特别是,原始的可感知位置的数据分区将交换的消息数量减少了30%。我们使用最大的公共数据集(根据我们所知),使用从网络图像中提取的$ 10 ^ 9 $ 128-d SIFT描述符,对并行LSH进行了评估。这比以前的LSH并行化可以处理的数据集大两个数量级。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号